Because one of the key issues in improving the performance of Speech Emotion Recognition\n(SER) systems is the choice of an effective feature representation, most of the research has focused\non developing a feature level fusion using a large set of features. In our study, we propose a\nrelatively low-dimensional feature set that combines three features: baseline Mel Frequency Cepstral\nCoefficients (MFCCs), MFCCs derived from DiscreteWavelet Transform (DWT) sub-band coefficients\nthat are denoted as DMFCC, and pitch based features. Moreover, the performance of the proposed\nfeature extraction method is evaluated in clean conditions and in the presence of several real-world\nnoises. Furthermore, conventional Machine Learning (ML) and Deep Learning (DL) classifiers\nare employed for comparison. The proposal is tested using speech utterances of both of the\nBerlin German Emotional Database (EMO-DB) and Interactive Emotional Dyadic Motion Capture\n(IEMOCAP) speech databases through speaker independent experiments. Experimental results show\nimprovement in speech emotion detection over baselines.
Loading....